CS-466/566: Math for AI

Goes back to 1940, when people started to build models that imitate the human brain.
Logistic regression (perceptron) is the core of neural networks started in 1950.
However, scientists in that time showed that a single perceptron can not solve xor problem (died).
Reborn in 1980, discovery of merging perceptrons together. But died due to the resources requirements
Reborn in the last decade with the advancement of the computation resouces.

\(x_1\)	\(x_2\)	\(y\)
0	0	0
0	1	0
1	0	0
1	1	1

\(x_1\)	\(x_2\)	\(y\)
0	0	0
0	1	1
1	0	1
1	1	1

\(x_1\)	\(x_2\)	\(y\)
0	0	0
0	1	1
1	0	1
1	1	0

For neuron (1), the weights are \(W^1_{11}, W^1_{12}, W^1_{13}\).
For neuron (2), the weights are \(W^1_{21}, W^1_{22}, W^1_{23}\).
For neuron (3), the weights are \(W^1_{31}, W^1_{32}, W^1_{33}\).
Superscript \(1\) indicates the first layer.
Subscript \(1, 2, 3\) indicates the first, second, and third neuron in the layer.

\[\begin{align*} \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & x_3^{(1)} \\ x_1^{(2)} & x_2^{(2)} & x_3^{(2)} \\ x_1^{(3)} & x_2^{(3)} & x_3^{(3)} \\ x_1^{(4)} & x_2^{(4)} & x_3^{(4)} \end{bmatrix} \times \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \\ w_{3,1} & w_{3,2} & w_{3,3} \end{bmatrix} = \begin{bmatrix} h_1^{(1)} & h_2^{(1)} & h_3^{(1)} \\ h_1^{(2)} & h_2^{(2)} & h_3^{(2)} \\ h_1^{(3)} & h_2^{(3)} & h_3^{(3)} \\ h_1^{(4)} & h_2^{(4)} & h_3^{(4)} \end{bmatrix} \end{align*}\] where \(x_i^{(j)}\) is feature \(i\) of house \(j\), \(w_{i,k}\) is the weight from input \(i\) to hidden node \(k\), and \(h_k^{(j)}\) is hidden node \(k\)’s value for house \(j\)

Computational Graph [Sigmoid]

Sigmoid: it is applied to each element of the hidden matrix

\[\begin{align*} \begin{bmatrix} h_1^{(1)} & h_2^{(1)} & h_3^{(1)} \\ h_1^{(2)} & h_2^{(2)} & h_3^{(2)} \\ h_1^{(3)} & h_2^{(3)} & h_3^{(3)} \\ h_1^{(4)} & h_2^{(4)} & h_3^{(4)} \end{bmatrix} \rightarrow \begin{bmatrix} \sigma(h_1^{(1)}) & \sigma(h_2^{(1)}) & \sigma(h_3^{(1)}) \\ \sigma(h_1^{(2)}) & \sigma(h_2^{(2)}) & \sigma(h_3^{(2)}) \\ \sigma(h_1^{(3)}) & \sigma(h_2^{(3)}) & \sigma(h_3^{(3)}) \\ \sigma(h_1^{(4)}) & \sigma(h_2^{(4)}) & \sigma(h_3^{(4)}) \end{bmatrix} = \begin{bmatrix} a_1^{(1)} & a_2^{(1)} & a_3^{(1)} \\ a_1^{(2)} & a_2^{(2)} & a_3^{(2)} \\ a_1^{(3)} & a_2^{(3)} & a_3^{(3)} \\ a_1^{(4)} & a_2^{(4)} & a_3^{(4)} \end{bmatrix} \end{align*}\] where \(a_k^{(j)}\) is the value of activation node \(k\) for house \(j\)

Computational Graph [Average]

Average: it is applied to each row of the activation matrix

\[\begin{align*} \begin{bmatrix} a_1^{(1)} & a_2^{(1)} & a_3^{(1)} \\ a_1^{(2)} & a_2^{(2)} & a_3^{(2)} \\ a_1^{(3)} & a_2^{(3)} & a_3^{(3)} \\ a_1^{(4)} & a_2^{(4)} & a_3^{(4)} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{a_1^{(1)} + a_2^{(1)} + a_3^{(1)}}{3} \\ \frac{a_1^{(2)} + a_2^{(2)} + a_3^{(2)}}{3} \\ \frac{a_1^{(3)} + a_2^{(3)} + a_3^{(3)}}{3} \\ \frac{a_1^{(4)} + a_2^{(4)} + a_3^{(4)}}{3} \end{bmatrix} \end{align*}\]

Every house now has an average probability of being sold predicted by three neurons.

Chain Rule of Neural Network

What is the derivative of \(O\) with respect to \(W\)?

\(\frac{\partial O}{\partial W} = \frac{\partial U}{\partial W} \odot \frac{\partial V}{\partial U} \odot \frac{\partial O}{\partial V}\)
Should be same shape as W, this symbol \(\odot\) is element-wise multiplication

If each computation node in the graph has a known, easy-to-compute local derivative, we can compute the derivative of the entire graph with respect to the weights using the Chain Rule

from typing import Union, List

Numberable = Union[float, int]

def ensure_number(num: Numberable):
    if isinstance(num, NumberWithGrad):
        return num
    else:
        return NumberWithGrad(num)        

class NumberWithGrad(object):
    def __init__(self, num: Numberable, depends_on: List[Numberable] = None, creation_op: str = ''):
        self.num = num
        self.grad = None
        self.depends_on = depends_on or []
        self.creation_op = creation_op

    def __add__(self, other: Numberable):
        return NumberWithGrad(self.num + ensure_number(other).num,
                              depends_on = [self, ensure_number(other)],
                              creation_op = 'add')
    
    def __mul__(self, other: Numberable = None):
        return NumberWithGrad(self.num * ensure_number(other).num,
                              depends_on = [self, ensure_number(other)],
                              creation_op = 'mul')

Simple Autograd Implementation [2/2]

    def backward(self, backward_grad: Numberable = None):
        if backward_grad is None: # first time calling backward
            self.grad = 1
        else: 
            # Accumulate gradient
            if self.grad is None:
                self.grad = backward_grad
            else:
                self.grad += backward_grad
        
        if self.creation_op == "add":
            # Send backward self.grad
            self.depends_on[0].backward(self.grad)
            self.depends_on[1].backward(self.grad)    

        if self.creation_op == "mul":
            # Calculate derivative w.r.t first element
            new = self.depends_on[1] * self.grad
            self.depends_on[0].backward(new.num)

            # Calculate derivative w.r.t second element
            new = self.depends_on[0] * self.grad
            self.depends_on[1].backward(new.num)

Autograd Example Usage

Let’s test our NumberWithGrad class:

a = NumberWithGrad(3)
b = a * 4
c = b + 3

c.backward()

print("Gradients:")
print("a.grad:", a.grad) # Expected: 4
print("b.grad:", b.grad) # Expected: 1

Thank You!

CS-466/566: Math for AI

TABLE OF CONTENTS

Introduction

TABLE OF CONTENTS

AND Gate Neural Network

OR Gate Neural Network

XOR Gate Neural Network

Solving XOR with a Neural Network

Neurons and the Brain

Types of Layers

Neural Network Design

Example of a Neural Network

TABLE OF CONTENTS

Neural Network As Computational Graph

Computational Graph [Matrix Multiplication]

Computational Graph [Sigmoid]

Computational Graph [Average]

Chain Rule of Neural Network

TABLE OF CONTENTS

Backward Mode

Basic Primitive Operations [Addition]

Basic Primitive Operations [Subtraction]

Basic Primitive Operations [Multiplication]

Basic Primitive Operations [Division]

Backward Mode Computational Graph Example

Derivative Example

Derivative Example

Simple Autograd Implementation [1/2]

Simple Autograd Implementation [2/2]

Autograd Example Usage

Thank You!